Intra-regional classification of Codonopsis Radix produced in Gansu province (China) by multi-elemental analysis and chemometric tools

Multi-elemental analysis is widely used to identify the geographical origins of plants. The purpose of this study was to explore the feasibility of combining chemometrics with multi-element analysis for classification of Codonopsis Radix from different producing regions of Gansu province (China). A total of 117 Codonopsis Radix samples from 7 counties of Gansu province were collected. Inductively coupled plasma mass spectrometry (ICP-MS) was used for the determination of 28 elements (39 K, 24 Mg, 44Ca, 27Al, 137Ba, 57Fe, 23Na, 88Sr, 55Mn, 66Zn, 65Cu, 85Rb, 61Ni, 53Cr, 51 V, 7Li, 208Pb, 59Co, 75As, 133Cs, 71 Ga, 77Se, 205Tl, 114Cd, 238U, 107Ag, 4Be and 202Hg). Among macro elements, 39 K showed the highest level, whereas 23Na was found to have the lowest content value. Micro elements showed the concentrations order of: 88Sr > 55Mn > 66Zn > 85Rb > 65Cu. Among trace elements, 53Cr and 61Ni showed higher content and 4Be was not detected in all samples. Intra-regions differentiation was performed by principal component analysis (PCA), cluster analysis (CA) and supervised learning algorithms such as linear discriminant analysis (LDA), k-nearest neighbors (k-NN), support vector machines (SVM), and random forests (RF). Among them, the RF model performed the best with an accuracy rate of 78.79%. Multi-elemental analysis combined with RF was a reliable method to identify the origins of Codonopsis Radix in Gansu province.


Results and discussion
Validation of analytical methods. The validation of the ICP-MS method was evaluated by linearity, sensitivity, accuracy and precision (see Supplementary Table S1 online). The observed R 2 values ranging from 0.9931 to 1.0000 and the experimental F value greater than the tabulated critical F conclude that the linearity of the analytical curves was good. The sensitivity is determined by detection (LOD) and quantification (LOQ) limits, which are according to the latest recommendations of IUPAC 19 , taking α and β (indicating types I and II errors, respectively) as the default value of 0.05. Except 4 Be and 107 Ag, the LOQ values of the remaining elements in the Codonopsis Radix samples in this study were all lower than their natural levels. Since there are no certified samples available for Codonopsis Radix, the accuracy of the method was determined by analyzing Codonopsis Radix samples (n = 3) fortified at their native levels before microwave digestion. The average recovery rate of most chemical substances is within the acceptable range (85-115%), which indicates that the loss during the digestion process is not obvious or almost negligible. The relative standard deviation (RSD) values for the elements were found to be less than 10%. Therefore, results of quality parameters confirmed that the used methods meet the standards required for the application of analytical methods.
Analysis of macro elements. The elements analyzed in the present study were classified into macro (≥ 100 μg/g), micro (10 ~ 100 μg/g) and trace (< 10 μg/g) according to their content. The total content of 7 macro elements in 117 Codonopsis Radix samples from different counties was summarized in Table 1. The order of the average content of macro elements was as follows: 39 K > 24 Mg > 27 Al > 44 Ca > 57 Fe > 137 Ba > 23 Na, of similar trends seen in the studies of Bai et al 20 . Among them, K, Ca, Na, Mg and Fe were essential for normal human metabolism. They were important in the osmotic pressure balance, enzymatic reaction and hematopoiesis 21,22 . The mean levels were 11,000 μg/g for 39 K, 1600 μg/g for 24 Mg, 370 μg/g for 44 Ca, 240 μg/g for 57 Fe, 170 μg/g for 23 Na. Thus, Codonopsis Radix could be used as a dietary supplement to provide the body with important mineral elements, especially K. Surprisingly, the content of K element in Codonopsis Radix is far higher than that in some Chinese herbal medicines such Rhizoma Coptidis 23 and Ephedrae herba 16 , and even higher than some fruits rich in K, such as figs 24 . Furthermore, compared with the report of Sun et al., it was found that the content of Mg and Fe in Chinese Angelica from Min county is much higher than that in Codonopsis Radix, and the content of Na is lower than that in Codonopsis Radix 25 . It showed that the absorption capacity of different medicinal materials to the metal elements in the same soil might be different. Al and Ba were not essential elements of the human body, and excessive intake was harmful to nervous and kidney system 26 . The mean content of 27 Al and 137 Ba was found to be 400 and 207 μg/g in all Codonopsis Radix samples, which was higher than those of some reported traditional Chinese medicines, such as Atractylodes macrocephala Koidz 17 and Rhizoma Coptidis 23 . In addition, the highest 39 K, 44 Ca, 24 Mg, 57 Fe, 27 Al and 137 Ba content was all found in the samples from Wen County, which may be related to the abundance of mineral elements in the local soil.  27 . Mn played a key role in the immune system and was also considered as potent antioxidants 28 . The mean content of 55 Mn and 66 Zn was 26 and 21 μg/g, respectively. The content of Mn element was lower than the mean content of Mn in 39 traditional Chinese medicines reported by Gyamfi, and the content of Zn is slightly higher 29 . The highest 55 Mn level was found in samples from Wen county and the highest 66 Zn level was found in Zhang county. In a certain concentration range, Cu is an essential element, which is related to the prosthetic groups of various enzymes and participates in key redox reactions. However, once Cu is excessive in the body, it will cause neurodegenerative diseases and impaired liver function 30 . The permissible limit set by the Green Trade Standard of Importing and Exporting Medicinal Plants and Preparations (WM-T2-2004) is 20 ug/g. The 65 Cu content in 117 Codonopsis Radix samples ranged from 2.2 to 9.6 μg/g, with an average content of 5.4 μg/g. The Cu content of all samples in this study was within the permissible limit and the highest content was found in samples from Wen county. Sr and Rb are not essential elements, the mean levels of 88 Sr and 85 Rb in the analyzed Codonopsis Radix samples were 48 and 6.8 μg/g, respectively. The highest 88 Sr level was found in samples from Wen county and the highest 85 Rb level was found in Tanchang county. Among them, the content of Sr was much higher than that of Rhizoma Coptidis 23 and Chinese Angelica 25 in China, and it was lower than that of some medicinal herbs in Turkey 14 .
Analysis of trace elements. The content of 15 trace elements in 117 Codonopsis Radix samples from different counties was given in Table 3. These include some essential trace elements such as Cr, Ni, Se, V and Co, which are essential nutrients that act as cofactors in metabolism and other biological processes. For example, chromium is important in the utilisation of glucose 31 . Selenium is an important component of the enzyme glutathione peroxidase 32 . Nickel is believed to act as a cofactor for iron absorption from the intestine during physiological processes 33 . Vanadium is an enzyme cofactor in hormone, glucose, lipid, bone and tooth metabolism 34 . Cobalt is an important component of vitamin B12 35 . The average content of 53 Cr, 61 Ni, 51 V, 77 Se and 59 Co was 1.2, 1.0, 0.7, 0.2 and 0.1 ug/g, respectively. The highest 61 Ni, 51 V, 59 Co and 77 Se content were all found in the samples from Wen County, and the highest 53 Cr content was found in Min county. The contents of non-toxic trace elements 7 Li, 71 Ga, 133 Cs and 107 Ag were also determined and their average contents were 0.3, 0.1, 0.09 and 0.002 Table 2. Macro elements content of Codonopsis Radix samples according to their geographical origin (Content are expressed as μg/g of dry matter basis).    75 As and 202 Hg content below the permissible limit. However, the content of 114 Cd in 4 samples from Wen county exceeded the permissible limits. It was found that the content of 208 Pb, 75 As, 114 Cd, and 202 Hg elements in the analyzed Codonopsis Radix from Gansu production areas was generally lower than the content of these heavy metal elements in other herbs, such as Argy Wormwood Leaf, Morinda Root, Zedoray Rhizome, Indian Madder Root 37 , which might be related to the soil, climate and less heavy metal pollution of planting areas. The content of Cr, As and Pb was lower than the values reported by Kong 38 .

Analysis of variance.
A Kruskal-Wallis test after Bonferroni correction was performed to initially investigate the content of elements with significant differences in Codonopsis Radix samples from seven counties of Gansu province in China (indicated by p < 0.05). The p-values showed that the content of 27 elements ( 4 Be was not detected) in Codonopsis Radix had significant differences. Therefore, all of the elements had a great influence on the provenance classification for one or more pairs of origins (see Supplementary Table S2 online). The box plots depicted the content profiles of the key elements with higher significance in Codonopsis Radix from different counties (see Supplementary Fig. S1 online). For 55 Mn, 44 Ca, 24 Mg, 208 Pb and 238 U elements, highest levels are observed in samples from Wen county. Levels of 71 Ga for samples from Wen and Tanchang counties are significantly higher than other counties. 66 Zn revealed higher levels for samples from Zhang county. Content of 55 Mn and 208 Pb were found to be very low in samples from Zhang county. The content of 88 Sr was found to be low in samples from Min, Tanchang and Zhang counties. 85 Rb revealed lower levels for samples from Min county.

Principal component analysis (PCA).
In order to get an overview of the data set and visualize the differences between Codonopsis Radix samples from different counties, PCA was firstly performed with all 27 elements with significant differences (p < 0.05). The total information content of a given number of principal components was represented by the cumulative percentage value (%) of the total variance. The first two principal components (PCs) represented 56.1% (PC1 represented for 46.3% and PC2 for 9.8%). Figure 1 exhibited PC1-PC2 score and loading plots. As shown in Fig. 1a, there was a large overlap between the scores corresponding to Codonopsis Radix samples from different origins. Nevertheless, Codonopsis Radix samples from the Wen county presented positive score on the considered PC1 and were distinguished with samples from the Zhang and Min counties with negative scores on PC1.
As shown in Fig. 1b Co. Codonopsis Radix samples with negative scores on PC1 indicated that lower content of these elements. On the other hand, 88 Sr, 23 Na, 61 Ni and 53 Cr were the main variables of PC2. Positive score corresponded to higher content of 88 Sr and 23 Na and negative scores indicate higher content of 61 Ni and 53 Cr.   75 As, 71 Ga and 77 Se elements, which indicated that the respective content profiles were quite similar (see Supplementary Fig. S2 online). In order to avoid generating excessive data, the heat map was recalculated containing only one of the highly related elements. As shown in Fig. 2, the horizontal tree diagram of Codonopsis Radix samples showed a completely separated cluster of samples from Wen county. Samples from other counties were not forming separated clusters. The indications of key elements related to provenance differentiation could be extracted from the content profiles. It displays not only high content of essential elements such as 44 Ca, 24 Mg, 39 K, but also toxic elements such as 208 Pb and 114 Cd for the Codonopsis Radix samples from Wen county.

Linear discriminant analysis (LDA).
In order to categorize the Codonopsis Radix samples in accordance with their geographical origins, a stepwise LDA was carried out. In the first step, the whole data set was explored to widely separate the Codonopsis Radix samples. As shown in Fig. 3a, the first two canonical discriminant functions (DFs) explained 64.74% of the variance. The plotted data showed that Codonopsis Radix samples from Wen county formed a distinct independent group. While samples from Lintao, Weiyuan, Longxi, Min, Zhang and Tanchang counties were clustered together with indistinct separation. In order to decide the accuracy of the samples in proximity, the elemental content of these counties was plotted separately. As shown in Fig. 3b, Codonopsis Radix samples from Min and Zhang counties were obviously separated. However, samples of Lintao, Weiyuan, Longxi and Tanchang clustered together. Undertaking this approach further, samples from Lintao, Weiyuan, Longxi and Tanchang counties grouped separately with unclear separation in Fig. 3c. www.nature.com/scientificreports/ Statistical classification analysis. So as to carry out a predictive classification analysis, data matrix was randomly divided into a training set (70% of the objects of the whole data matrix) and the test set (30%). Training set with known class membership was used to calculate the classifier. The test set contained objects not included in the training, and had known class membership to verify the model built.
In this work, four chemometric models, namely LDA, k-NN, SVM, and RF, were selected and tested to classify Codonopsis Radix samples according to their geographic origin. The LDA, k-NN, SVM and RF methods needed to optimize some parameters and build a model, and then evaluated it as a predictive tool. Trained each model by using k-fold cross-validation on the training set to build different classifiers. This process was repeated n times, so each subset must be tested at least once. In this work, the choices of number of neighbor k for k-NN; number of variables evaluated at each split (mtry) and number of trees (ntree) for RF; and penalty factor C and ε of the ε-insensitive loss function for SVM, were calculated by using ten-fold cross-validation technique repeated five times by which maximum accuracy was selected.
Once the best value for each model was selected, the sensitivity (samples that belonging to that category and correctly classified in that category), specificity (samples that do not belong to the modeled category and correctly classified as not belonging to the sample) and the mean accuracy rate were considered for evaluation of the classification achieved using chemometric methods. The results indicating the performance of the different classification methods were shown in Table 4.
According to the analysis of Table 4, it could be discovered that the four chemometric methods showed different degrees of success in the prediction of test samples. The sequence of successful recognition rate was as follows: RF > SVM > LDA > k-NN. RF showed the best performance in distinguishing the Codonopsis Radix samples based on their geographical origins, with total categorization accuracy of 78.79%. Great results were acquired  However, the Codonopsis Radix from Weiyuan, Lintao and Longxi origin could not be well predicted, which might be due to the close proximity of the three production areas and similar natural conditions and soil types. In sum, the elemental analysis based on ICP-MS chosen in this study causes higher acquisition and maintenance costs compared to other elemental techniques. On the other hand, it can simultaneously determine multiple elements combined with low detection limits. In addition, the ICP-MS investigation did not reveal considerable disadvantages in terms of analysis time or cost compared to further establish identification methods. Therefore, our technology based on ICP-MS combined with chemometric analysis is a powerful tool for original traceability and identification of medicinal materials, which is consistent with results reported in many studies [39][40][41] . In the current study, differentiation through elemental composition was found reliable and satisfactory for Codonopsis Radix collected from 7 counties in Gansu province. Among all statistical tests, RF proved to be the most successful of origin differentiation of the analyzed samples. Our results might provide a new strategy for the origin traceability of Chinese herbal medicines. It should be emphasized that the 117 Codonopsis Radix samples this study was all harvested at their optimum harvesting period. Except that samples from Wen county had longer growth years, the samples from the other six producing areas had the same growth years. More samples need to be collected for further analysis to explore whether the growth period and harvest period will affect the content of elements in Codonopsis. Regents. Suprapure nitric acid (65%) from Merck (Darmstadt, Germany) was used to digest the samples.

Materials and methods
Ultrapure water (18.2MΩ cm resistivity at 25 °C) was obtained from a water purification system Milli Q (Millipore, Germany). Certified multi-element standard solutions ( 39 K, 24 Mg, 44 Ca, 27 Al, 137 Ba, 57 Fe, 23  Sample pretreatment and digestion. The dried Codonopsis Radix (water content ≤ 16.0%) was baked in an oven at 60 ℃ for 2 h, grounded using a pulverizer and stored in the plastic bags. About 300 mg of the samples were accurately weighed into a PTFE digestion vessel. 3 mL of concentrated HNO 3 and 1 mL of concentrated H 2 O were added to the vessel and waited for about 20 min before the vessel is closed. Digestion of Codonopsis Radix was performed using MARS microwave-assisted digestion system (CEM, United Kingdom). The digestion procedure was as follows: (1) 900 W at 110℃ for 10 min; (2) ICP-MS analysis. ICP-MS analysis was carried out on an Agilent 7900 instrument (Agilent Technologies, Santa Clara, CA, USA). 28 elements ( 39 K, 24 Mg, 44 Ca, 27 Al, 137 Ba, 57 Fe, 23  Cr, 51 V, 7 Li, 208 Pb, 59 Co, 75 As, 133 Cs, 71 Ga, 77 Se, 205 Tl, 114 Cd, 238 U, 107 Ag, 202 Hg and 9 Be) were determined. The Table 4. Discrimination results obtained with the different chemometrics models. a ntree number of trees, mtry number of variables tried at each split. b C penalty factor, ε ε-insensitive loss function. c k number of k neighbors.

Groups
Number of samples LDA RF (ntree = 1000, mtry = 16) a SVM (C = 10, ε = 0.01) b k-NN (k = 1) c Statistical analysis. First, the concentration of all elements was compared with the LOD and LOQ. 4 Be was not detected in all samples. It was found that the 107 Ag concentration contained a value lower than the LOD, which meant that there were no detectable 107 Ag in some Codonopsis Radix samples. In order to avoid difficulties in applying the logarithmic function, the concentration of the 107 Ag element located below the LOD was set to the LOD level instead of zero. Three copies of all samples were averaged, and the logarithm of the average content of the elements to the base of 10 (log10) was used for data analysis. The data matrix consisted of 27 columns (Be was not detected) and 117 rows for chemometrics analysis. The columns represented the content of elements and the rows corresponded each sample. Analysis of variance (ANOVA) was used to compare the differences of elemental content of the samples, with p < 0.05 as the significance level. Principal component analysis (PCA) was carried out to reduce dimensionality and to visualize the dataset. Based on Pearson correlation, hierarchical cluster analysis (HCA) was executed to detect feature similarities in the dataset.

Train set Test set Sensitivity (%) Specificity (%) Sensitivity (%) Specificity (%) Sensitivity (%) Specificity (%) Sensitivity (%) Specificity (%)
Four chemometric methods were performed to evaluated different models for classification of Codonopsis Radix grown in Gansu provinces according to their origins: linear discriminant analysis (LDA), k-nearest neighbor (k-NN), support vector machine (SVM), and random forest (RF). LDA is a classification method designed to maximize the ratio of between-class variance to the within-class variance for achieving maximum separability. The decision boundary created by LDA is called a discriminant function, and is a linear combination of the variables that can best distinguish categories 42 . k-NN is a classification technique that uses the Euclidean distance to calculate the k samples (neighbors) closest to the test sample to the test sample in the feature space, and then sets its category label to the most frequent category label that appears in the found neighbors 43 . SVM is a powerful method for building classifiers. It aims to create a decision boundary between two classes, so that labels can be predicted based on one or more feature vectors 44 . RF consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes the model's prediction 45 . All basic statistic and multivariate analysis were performed with R software version R 3.6.3.

Conclusion
The elemental composition profile was assessed for the first time in detail for macro, micro and trace elements in the Codonopsis Radix samples collected from seven counties of Gansu province. Among the macro elements, 39 K had the highest and 23 Na the lowest content levels. Among the micro elements, the content order was found to be 88 Sr > 55 Mn > 66 Zn > 85 Rb > 65 Cu. Among the trace elements, 53 Cr and 61 Ni showed higher content. The contents of trace toxic elements 208Pb, 75As and 202Hg were below the permissible limit in all samples. And, 114Cd was also below the permissible limit except the 4 samples from Wenxian counties. Statistical analyses of the data using RF successfully classified the Codonopsis Radix samples from Wenxian, Minxian, Zhangxian, Tanchang and Weiyuan (or Lintao or Longxi) production areas. The optimal parameters of this model are preferably ntree = 1000 and mtry = 16.